Generalized linear models (GLMs) are a class of linear-based regression models developed to handle varying types of error distributions (other than normally (Gaussian) distributed). Common examples include responses that are binary (e.g., 0 and 1), or count data that are discrete values (integers) and never negative. While data transformations prior to fitting a model can be done, it may be difficult to properly meet model assumptions. We will show examples of how to perform data transformations as well as appropriate GLMs based on data properties.
We will first analyze data that came from a trial of different insecticide applications and the resultant number of insects. The four sprays A, B, C, and F are the focus of this analysis and are filtered from the original dataset. We will approach this analysis just as we did in the previous ANOVA lecture.
## Rows: 72 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): spray
## dbl (1): count
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Count_Data <- Count_Data %>%
filter(spray == 'A' | spray == 'B' | spray == 'C' | spray == 'F') %>%
droplevels()
Count_Data$spray <- as.factor(Count_Data$spray)It is always a good idea to visualize the data prior to the analysis. We are comparing the count of insects among the different treatment types. We are looking for a treatment that has the fewest number of insects observed, which would indicate the most effective treatment.
ggplot(Count_Data, aes(x = spray, y = count)) +
geom_boxplot(outlier.shape = NA, width = 0.5) +
geom_jitter(height = 0, width = 0.1, size = 3, pch = 21) +
labs(x = "Spray Treatment Type", y = "Count") +
theme_classic() +
theme(axis.title = element_text(face = "bold", size = 15),
axis.text = element_text(face = "bold", size = 15))When we work with certain types of data - count data in this instance - the linear models often fail to meet the linear model assumptions. A visual observation of the data show that the distribution is not negative and not normally distributed. However, the linear model assumption of normality is not about the response variable but the errors or residuals of the model. This histogram can help guide what transformations might be helpful to meet linear model assumptions.
First we will fit a linear model with the non-normal data to show the distribution of the residuals.
lm1 <- lm(count ~ spray, data = Count_Data)
Anova(lm1, type = 2) # Appropriate since there is no interaction effect| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| spray | 1648.729 | 3 | 26.47836 | 0 |
| Residuals | 913.250 | 44 | NA | NA |
## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot(resid(lm1) ~ fitted(lm1)) + # funnel shapes or curvature are signs of heteroskedasticity or trend in residuals
abline(h = 0)## integer(0)
## [1] 45 46
##
## Shapiro-Wilk normality test
##
## data: resid(lm1)
## W = 0.96838, p-value = 0.2188
| Df | F value | Pr(>F) | |
|---|---|---|---|
| group | 3 | 2.711731 | 0.0563515 |
| 44 | NA | NA |
The variation between each spray treatment is not the same based on the boxplot, and it almost fails to meet the homogeneity of variance assumption. Generally not the worst example of count data failing to meet the normally distributed residuals assumption. We will move forward with the other modeling approaches to illustrate the lesson.
To model the effect of different sprays on insect counts that more closely meets the normally distributed residuals assumption using log-linear model by using the natural log-transformation on the response data (count). You need to add some small value to the variable before the natural log-transformation because log(0) is undefined. You could add any small constant but +1 is convenient because zeros go back to zero after the transformation.
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| spray | 29.365095 | 3 | 56.68159 | 0 |
| Residuals | 7.598377 | 44 | NA | NA |
## integer(0)
## [1] 27 25
##
## Shapiro-Wilk normality test
##
## data: resid(lm2)
## W = 0.97753, p-value = 0.4804
| Df | F value | Pr(>F) | |
|---|---|---|---|
| group | 3 | 3.051575 | 0.0382753 |
| 44 | NA | NA |
The transformation makes the homogeneity of the residuals a little more consistent across treatment types according to the boxplot but fails the levene Test while passing the shapiro test for normally distributed residuals. Another step to complete the lesson.
A generalized linear model works similarly to the traditional linear model we have been covering where we estimate the relationship between the explanatory variables and the response variables, but the new element to the model is known as the link function. This link function allows the residuals from the model to follow a different distribution (e.g., binomial, Poisson, etc.), which modifies the model assumptions. There is specific syntax used in the model functions to reflect the appropriate link function based on the type of data analyzed. For this example - we have count data - the Poisson distribution is often what is tested first in a generalized linear model.
Now we will use GLMs to examine the effect of the different sprays on the counts of insects. The glm() function is a general function that fits a generalized linear model by specifying the ‘family’ (error distribution). The default setting is ‘Gaussian’ distribution (normal).
| LR Chisq | Df | Pr(>Chisq) | |
|---|---|---|---|
| spray | 185.8313 | 3 | 0 |
| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| spray | 185.83131 | 3 | 35.83224 | 0 |
| Residuals | 76.06351 | 44 | NA | NA |
For GLMs, Anova returns a likelihood ratio test with a chi-square value. You can get an F-statistics by using a different function to fit the glm. However, the Wald Chisquare test tells you the same information that the F-test does in this scenario - there is significant variation among the different levels of spray.
We can call for the summary of the model results showing the estimates of the different spray - notice they are on the log-scale - not the response scale due to the log link for a Poisson distribution. We still need to review the model assumptions.
##
## Call:
## lm(formula = count ~ spray, data = Count_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3333 -2.3750 -0.5833 2.0625 9.3333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.5000 1.3152 11.025 0.0000000000000302 ***
## sprayB 0.8333 1.8599 0.448 0.656
## sprayC -12.4167 1.8599 -6.676 0.0000000341855369 ***
## sprayF 2.1667 1.8599 1.165 0.250
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.556 on 44 degrees of freedom
## Multiple R-squared: 0.6435, Adjusted R-squared: 0.6192
## F-statistic: 26.48 on 3 and 44 DF, p-value: 0.0000000006091
##
## Call:
## lm(formula = log(count + 1) ~ spray, data = Count_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95261 -0.25946 0.01131 0.32350 1.12683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6967 0.1200 22.480 < 0.0000000000000002 ***
## sprayB 0.0598 0.1696 0.353 0.726
## sprayC -1.7441 0.1696 -10.281 0.000000000000283 ***
## sprayF 0.1189 0.1696 0.701 0.487
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4156 on 44 degrees of freedom
## Multiple R-squared: 0.7944, Adjusted R-squared: 0.7804
## F-statistic: 56.68 on 3 and 44 DF, p-value: 0.000000000000003701
##
## Call:
## glm(formula = count ~ spray, family = poisson(link = "log"),
## data = Count_Data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.67415 0.07581 35.274 <0.0000000000000002 ***
## sprayB 0.05588 0.10574 0.528 0.597
## sprayC -1.94018 0.21389 -9.071 <0.0000000000000002 ***
## sprayF 0.13926 0.10367 1.343 0.179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 262.154 on 47 degrees of freedom
## Residual deviance: 76.323 on 44 degrees of freedom
## AIC: 273.93
##
## Number of Fisher Scoring iterations: 5
Unlike linear models, GLMs do not assume normally distributed residuals. The assessment of residuals is used to check model fit and independence, rather than normality, so the ‘better’ fit here refers to appropriately modeling the variance structure of the count data.
The results from the assessment of the residuals for the linear model indicates some non-constant variance in the residuals and minor deviation in the QQ-plot. The log-transformed count model didn’t seem to help much either. The generalized linear model seemed to handle things well. This is a better approach to model this type of data.
We compare the emmeans among the different spray treatments for each of the linear models (un-transformed, log-transformed, and Poisson).
## spray emmean SE df lower.CL upper.CL
## A 14.50 1.32 44 11.849 17.15
## B 15.33 1.32 44 12.683 17.98
## C 2.08 1.32 44 -0.567 4.73
## F 16.67 1.32 44 14.016 19.32
##
## Confidence level used: 0.95
emmeans(lm2, ~ spray, type = 'response') # type argument helps to back transform the values on the original scale## spray response SE df lower.CL upper.CL
## A 13.83 1.780 44 10.65 17.9
## B 14.75 1.890 44 11.36 19.1
## C 1.59 0.311 44 1.04 2.3
## F 15.70 2.000 44 12.12 20.3
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log(mu + 1) scale
## spray rate SE df asymp.LCL asymp.UCL
## A 14.50 1.100 Inf 12.50 16.82
## B 15.33 1.130 Inf 13.27 17.72
## C 2.08 0.417 Inf 1.41 3.08
## F 16.67 1.180 Inf 14.51 19.14
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
Familiar code to create the results for the multiple comparisons tests.
| spray | emmean | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|
| 3 | C | 2.083333 | 1.315158 | 44 | -0.5671931 | 4.73386 | a |
| 1 | A | 14.500000 | 1.315158 | 44 | 11.8494735 | 17.15053 | b |
| 2 | B | 15.333333 | 1.315158 | 44 | 12.6828069 | 17.98386 | b |
| 4 | F | 16.666667 | 1.315158 | 44 | 14.0161402 | 19.31719 | b |
emmeans2 <- lm2 %>%
emmeans(specs = "spray", type = "response") %>%
cld(adjust = "none", Letters = letters)
emmeans2| spray | response | SE | df | lower.CL | upper.CL | .group | |
|---|---|---|---|---|---|---|---|
| 3 | C | 1.592459 | 0.3109964 | 44 | 1.035699 | 2.301491 | a |
| 1 | A | 13.831278 | 1.7791887 | 44 | 10.646095 | 17.887602 | b |
| 2 | B | 14.745307 | 1.8888374 | 44 | 11.363826 | 19.051616 | b |
| 4 | F | 15.704220 | 2.0038704 | 44 | 12.116801 | 20.272789 | b |
emmeans3 <- glm1 %>%
emmeans(specs = "spray", type = "response") %>%
cld(Letters = letters)
emmeans3| spray | rate | SE | df | asymp.LCL | asymp.UCL | .group | |
|---|---|---|---|---|---|---|---|
| 3 | C | 2.083333 | 0.4166664 | Inf | 1.407727 | 3.083181 | a |
| 1 | A | 14.500000 | 1.0992422 | Inf | 12.497944 | 16.822767 | b |
| 2 | B | 15.333333 | 1.1303883 | Inf | 13.270435 | 17.716910 | b |
| 4 | F | 16.666667 | 1.1785113 | Inf | 14.509743 | 19.144225 | b |
emmeans3_diff <- glm1 %>%
emmeans(specs = "spray", type = "response") %>%
cld(adjust = "none", Letters = letters, details = T)
emmeans3_diff## $emmeans
## spray rate SE df asymp.LCL asymp.UCL .group
## C 2.08 0.417 Inf 1.41 3.08 a
## A 14.50 1.100 Inf 12.50 16.82 b
## B 15.33 1.130 Inf 13.27 17.72 b
## F 16.67 1.180 Inf 14.51 19.14 b
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
## Tests are performed on the log scale
## significance level used: alpha = 0.05
## NOTE: If two or more means share the same grouping symbol,
## then we cannot show them to be different.
## But we also did not show them to be the same.
##
## $comparisons
## contrast ratio SE df null z.ratio p.value
## A / C 6.96 1.490 Inf 1 9.071 <.0001
## B / C 7.36 1.570 Inf 1 9.364 <.0001
## B / A 1.06 0.112 Inf 1 0.528 0.5972
## F / C 8.00 1.700 Inf 1 9.803 <.0001
## F / A 1.15 0.119 Inf 1 1.343 0.1792
## F / B 1.09 0.111 Inf 1 0.816 0.4144
##
## Tests are performed on the log scale
If you wanted to know the exact difference between the average count for the different treatment levels, you can call for the details in the cld function. We see the difference between the F and C treatment is 8 insects on average.
emmeans1 <- emmeans1 %>%
as_tibble()
emmeans3 <- emmeans3 %>%
as_tibble()
Plot_lm1 <- ggplot() +
geom_point(data = Count_Data, aes(y = count, x = spray), position = position_jitter(width = 0.1), alpha = 0.5) + # dots representing the raw data
geom_boxplot(data = Count_Data, aes(y = count, x = spray), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
geom_point(data = emmeans1, aes(y = emmean, x = spray), position = position_nudge(x = 0.15), size = 2, color = "red") + # red dots representing the adjusted means
geom_errorbar(data = emmeans1, aes(ymin = lower.CL, ymax = upper.CL, x = spray), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
geom_text(data = emmeans1, aes(y = emmean, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters
labs(y = "Counts", x = "Treatment") +
theme_classic()
Plot_lm1Plot_glm1 <- ggplot() +
geom_point(data = Count_Data, aes(y = count, x = spray), position = position_jitter(width = 0.1), alpha = 0.5) + # dots representing the raw data
geom_boxplot(data = Count_Data, aes(y = count, x = spray), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
geom_point(data = emmeans3, aes(y = rate, x = spray), position = position_nudge(x = 0.15), size = 2, color = "red") + # red dots representing the adjusted means
geom_errorbar(data = emmeans3, aes(ymin = asymp.LCL, ymax = asymp.UCL, x = spray), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
geom_text(data = emmeans3, aes(y = rate, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters
labs(y = "Counts", x = "Treatment") +
theme_classic()
Plot_glm1We can interpret the results from the analysis similarly between the linear model and the generalized linear model using the Poisson distribution. The main difference that we can see in the data visualization is the size of the confidence intervals showing the mean separation between the different treatments.
We can see that the C spray treatment had the lowest number of insects compared to each other treatment. The difference was between ~ 7 to 8 insects fewer.
The Poisson distribution is a unique distribution that is often used to model count data. There are specific conditions that describe the Poisson distribution that don’t always hold for count data - the assumption is that the mean and variance are equal. When the variance of the data is larger than the mean of the data the data may follow a negative binomial distribution. We have to determine if the variance is greater than the mean enough to determine if we should fit the model using a Poisson distribution or a negative binomial distribution. We will review the steps to assess these characteristics of count data.
The general approach is to fit a model with the Poisson distribution and check for overdispersion and then fit the Negative Binomial distribution for models with overdispersion.
## Rows: 1300 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): block, trt, spray, lead
## dbl (3): row, col, y
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| row | col | y | block | trt | spray | lead |
|---|---|---|---|---|---|---|
| 1 | 1 | 1 | B1 | T1 | N | N |
| 2 | 1 | 0 | B1 | T1 | N | N |
| 3 | 1 | 1 | B1 | T1 | N | N |
| 4 | 1 | 3 | B1 | T1 | N | N |
| 5 | 1 | 6 | B1 | T1 | N | N |
| 6 | 1 | 0 | B2 | T1 | N | N |
ggplot(data, aes(x = spray, y = y, fill = lead)) +
geom_violin(scale = "width", adjust = 2) + # produces the violin geometry that shows the distribution of the data
geom_point(position = position_jitterdodge(jitter.width = 0.5, # jitters the horizontal positioning of the data
jitter.height = 0.1, # jitters the vertical positioning of the data - not always advised
dodge.width = 1), # shifts the horizontal grouping of the data
alpha = 0.3) +
theme_classic(base_size = 14) # changes the size of text over all the plot##
## Call:
## glm(formula = y ~ spray * lead, family = "poisson", data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.33647 0.04688 7.177 0.000000000000712 ***
## sprayY -1.02043 0.09108 -11.204 < 0.0000000000000002 ***
## leadY -0.49628 0.07621 -6.512 0.000000000074139 ***
## sprayY:leadY 0.29425 0.13917 2.114 0.0345 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1955.9 on 1299 degrees of freedom
## Residual deviance: 1720.4 on 1296 degrees of freedom
## AIC: 3125.5
##
## Number of Fisher Scoring iterations: 6
| Sum Sq | Df | F values | Pr(>F) | |
|---|---|---|---|---|
| spray | 142.348722 | 1 | 105.076131 | 0.0000000 |
| lead | 43.721150 | 1 | 32.273203 | 0.0000000 |
| spray:lead | 4.452271 | 1 | 3.286489 | 0.0700835 |
| Residuals | 1755.716946 | 1296 | NA | NA |
## # Overdispersion test
##
## dispersion ratio = 1.355
## Pearson's Chi-Squared = 1755.717
## p-value = < 0.001
## Overdispersion detected.
A significant result p-value < 0.05 suggests evidence of overdispersion.
| Sum Sq | Df | F values | Pr(>F) | |
|---|---|---|---|---|
| spray | 98.112727 | 1 | 100.931739 | 0.0000000 |
| lead | 28.129281 | 1 | 28.937502 | 0.0000001 |
| spray:lead | 3.386873 | 1 | 3.484186 | 0.0621835 |
| Residuals | 1259.802870 | 1296 | NA | NA |
Other methods to compare residuals for poisson and negative binomial models include functions from the DHARMa package.
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
## Object of Class DHARMa with simulated residuals based on 1000 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help.
##
## Scaled residual values: 0.3917235 0.08981703 0.5502455 0.9028255 0.9987197 0.07915011 0.7648524 0.6775526 0.3794544 0.8701158 0.1695846 0.881399 0.3230802 0.169781 0.635007 0.08515065 0.2523009 0.9841811 0.90124 0.8278879 ...
## Object of Class DHARMa with simulated residuals based on 1000 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help.
##
## Scaled residual values: 0.3489687 0.2603192 0.4754614 0.8802034 0.98509 0.007354437 0.7371795 0.782935 0.3902256 0.8651646 0.2697196 0.8836759 0.631547 0.1996464 0.644313 0.307934 0.4189164 0.9364823 0.852556 0.9019994 ...
No sign of dependency in the Residuals vs Fitted plot - not unusual to see non-normally distributed residuals here.
Results from model suggest that the interaction effect isn’t significant, but we can confirm with multiple comparisons tests.
Familiar code to create the results for the multiple comparisons tests.
all_means <- mod2 %>%
emmeans(~ spray:lead) %>%
cld(Letters = letters, type = "response", details = F)
all_means| spray | lead | response | SE | df | asymp.LCL | asymp.UCL | .group | |
|---|---|---|---|---|---|---|---|---|
| 4 | Y | Y | 0.4123077 | 0.0391105 | Inf | 0.3423564 | 0.4965517 | a |
| 2 | Y | N | 0.5046154 | 0.0440863 | Inf | 0.4252010 | 0.5988620 | a |
| 3 | N | Y | 0.8523077 | 0.0611373 | Inf | 0.7405229 | 0.9809668 | b |
| 1 | N | N | 1.4000000 | 0.0855387 | Inf | 1.2419967 | 1.5781041 | c |
all_means <- all_means %>%
as_tibble()
Plot_NB <- ggplot() +
geom_point(data = data, aes(y = y,x = spray),position = position_jitter(width = 0.1)) + # dots representing the raw data
facet_wrap(~ lead) +
geom_boxplot(data = data, aes(y = y, x = spray), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
geom_point(data = all_means, aes(y = response, x = spray), position = position_nudge(x = 0.15), size = 2, color = "red") + # red dots representing the adjusted means
geom_errorbar(data = all_means, aes(ymin = asymp.LCL, ymax = asymp.UCL, x = spray), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
geom_text(data = all_means, aes(y = response, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters
labs(y = "Counts", x = "Spray Treatment") +
theme_classic()
Plot_NBWhile the ANOVA test showed marginally significant interaction effect, the post-hoc tests showed clear significant differences among the interaction levels. This can happen where there is a lot of variation within groups where the ANOVA test doesn’t show significant differences in the overall model. The post-hoc comparisons only look at pairwise comparisons and therefore a fraction of the overall variance in the model.
The spray treatments had lower insect counts but also the lead treatment had lower counts without the spray treatment. There was no difference between the counts in the spray treatments with or without the lead treatment.
There are instances where count data will have excessive zeros where the poisson or negative binomial models will produce biased results. If you believe that your data might have an excess number of zeros, I would suggest you look into zero-inflated or hurdle models. These models require careful consideration about the context of the “zero” observations. We won’t be covering this topic further, but if you have questions about it please feel free to email us.
For data that follow a binomial error distribution often have a binomial or dichotomous response. The example dataset follows the study of 800 observations among two pepper fields documenting the presence or absence of Phytophera disease. The response variable is the presence or absence of the disease, and the independent variables are field and soil moisture percent we will be reviewing to determine if they impact the probability of the disease being observed.
## Rows: 800 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): field, disease
## dbl (4): row, quadrat, water, leaf
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| field | row | quadrat | disease | water | leaf |
|---|---|---|---|---|---|
| F1 | 1 | 1 | N | 15.05 | 5 |
| F1 | 1 | 2 | N | 14.32 | 2 |
| F1 | 1 | 3 | N | 13.99 | 3 |
| F1 | 1 | 4 | N | 13.85 | 0 |
| F1 | 1 | 5 | N | 13.16 | 5 |
| F1 | 1 | 6 | N | 11.81 | 5 |
With logistic regression, it can be more difficult to interpret the coefficients, so a visual representation of the trends can be easier to understand. In order to do that we need to create a new variable that is 0 or 1 instead of “Y” or “N”.
We can look at the presence of the disease across both fields.
mod1 <- glm(disease_num ~ field, data = data, family = binomial(link = "logit"))
car::Anova(mod1, type = 2, test.statistic = "F") # No interaction so type two is OK| Sum Sq | Df | F value | Pr(>F) | |
|---|---|---|---|---|
| field | 0.4978846 | 1 | 0.4966398 | 0.4811859 |
| Residuals | 800.0000000 | 798 | NA | NA |
##
## Call:
## glm(formula = disease_num ~ field, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8575 0.1463 -12.695 <0.0000000000000002 ***
## fieldF2 0.1423 0.2019 0.705 0.481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 658.74 on 799 degrees of freedom
## Residual deviance: 658.24 on 798 degrees of freedom
## AIC: 662.24
##
## Number of Fisher Scoring iterations: 4
autoplot(mod1) # Not too informative for this type of model - just make sure that the type of data matches the binomial regression (e.g., presence/absence).means <- mod1 %>%
emmeans(pairwise ~ field, type = "response") %>% # the response scale will be the probability of disease presence
cld(Letters = letters)
means| field | prob | SE | df | asymp.LCL | asymp.UCL | .group |
|---|---|---|---|---|---|---|
| F1 | 0.1350 | 0.0170862 | Inf | 0.1048716 | 0.1721197 | a |
| F2 | 0.1525 | 0.0179752 | Inf | 0.1204985 | 0.1911532 | a |
The non-significant result suggests that the presence of the disease is not different between the two fields.
ggplot(data, aes(x = field, y = disease_num)) +
geom_jitter(height = 0.01) +
geom_errorbar(data = means,
aes(x = field, y = prob, ymin = (prob - SE), ymax = (prob + SE)),
width = 0.1, lwd = 1, position = position_dodge(width = 0.5)) +
geom_point(data = means, aes(x = field, y = prob),
size = 2, position = position_dodge(width = 0.5)) +
geom_text(data = means, aes(y = prob, x = field, label = str_trim(.group)), color = "black", position = position_dodgenudge(width = 1.2, x = -0.15), hjust = 0, angle = 0) +
theme_classic()## Warning: `position_dodge()` requires non-overlapping x intervals.
There are no differences in disease prevalence between fields, but there are other variables to consider with this dataset. The other variable of interest includes the soil moisture percent as a potential covariate.
## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_point()`).
mod1 <- glm(disease_num ~ field * water, data = data, family = binomial(link = "logit"))
car::Anova(mod1, type = 3, test.statistic = "F") | Sum Sq | Df | F values | Pr(>F) | |
|---|---|---|---|---|
| field | 26.671274 | 1 | 25.6585490 | 0.0000005 |
| water | 0.758459 | 1 | 0.7296598 | 0.3932534 |
| field:water | 36.854331 | 1 | 35.4549490 | 0.0000000 |
| Residuals | 818.062332 | 787 | NA | NA |
##
## Call:
## glm(formula = disease_num ~ field * water, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.61224 0.82857 -3.153 0.00162 **
## fieldF2 -5.82953 1.18699 -4.911 0.000000905 ***
## water 0.06673 0.07437 0.897 0.36953
## fieldF2:water 0.61732 0.10802 5.715 0.000000011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 648.80 on 790 degrees of freedom
## Residual deviance: 534.19 on 787 degrees of freedom
## (9 observations deleted due to missingness)
## AIC: 542.19
##
## Number of Fisher Scoring iterations: 6
| field | water.trend | SE | df | asymp.LCL | asymp.UCL | .group |
|---|---|---|---|---|---|---|
| F1 | 0.0667339 | 0.0743679 | Inf | -0.0790246 | 0.2124924 | a |
| F2 | 0.6840556 | 0.0783472 | Inf | 0.5304980 | 0.8376132 | b |
The significant result suggests that the trend line is different between the two fields. With logistic regression, it can be more difficult to interpret the coefficients, so a visual representation of the trends can be easier to understand.
ggplot(data, aes(x = water, y = disease_num, color = field)) +
geom_jitter(height = 0.01) +
geom_smooth(method = "glm", method.args = list(family = "binomial"),
formula = y ~ x, se = T, lwd = 1.5) +
labs(y = "Probability of Disease", x = "Soil Moisture %") +
theme_classic()## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_point()`).
For field 1 there is a gradual trend describing the increase in probability of observing the disease as soil moisture % increases. For field 2 there is a more rapid increase in probability of observing the disease where the probability quickly increases to almost 100% between 10 and 15 % of soil moisture.
Logistic regression can be further complicated when the response variable is coded as either multinomial (many categories) or ordinal (discrete categories along a continuum). You would use the polr() function from the MASS package to estimate the ordered logistic regression model. The function name comes from the proportional odds logistic regression, which hints at the proportional odds assumption (a.k.a. parallel regression assumption) of the model. We won’t be covering this topic further, but if you have questions about it please feel free to email us.
We are going to revisit the mead.lamb (Mead et al. 2002) data to use as practice. We can upload the data from the agridat package.
data(mead.lamb)
dat <- mead.lamb
# Farm: three locations
# Breed: three breeds of lamb
# Lamb Class: four levels of class (number of live lambs per birth)
# Y: count of ewes in class
# Model specification steps
mod1 <- glm(y ~ farm * breed * lambclass, data = dat, family = poisson)
mod2 <- glm(y ~ farm*breed + breed * lambclass + farm * lambclass, data = dat, family = poisson)
mod3 <- glm(y ~ farm * breed + breed * lambclass, data = dat, family = poisson)
mod4 <- glm(y ~ farm * breed + farm * lambclass, data = dat, family = poisson)
mod5 <- glm(y ~ farm * breed + lambclass, data = dat, family = poisson)
mod6 <- glm(y ~ farm + breed + lambclass, data = dat, family = poisson)
mod7 <- glm(y ~ farm * breed, data = dat, family = poisson)
AIC(mod1, mod2, mod3, mod4, mod5, mod6, mod7)| df | AIC | |
|---|---|---|
| mod1 | 36 | 223.3074 |
| mod2 | 24 | 213.8873 |
| mod3 | 18 | 303.5781 |
| mod4 | 18 | 206.1520 |
| mod5 | 12 | 306.5457 |
| mod6 | 8 | 373.5815 |
| mod7 | 9 | 852.9800 |
# Model 4 is the best model given it's lowest AIC value
# Should also use theory to drive what terms should be in the model and what hypotheses you want to test
# Check for overdispersion and look at the residual plots
# Then complete the multiple comparisons tests - show all pairwise comparisons for both interactions (e.g., farm:lambclass, and farm:breed)
# Finally make a pair of figures that shows the results and the interpretation of the statistical analyses
check_overdispersion(mod4)## # Overdispersion test
##
## dispersion ratio = 1.074
## Pearson's Chi-Squared = 19.338
## p-value = 0.371
## No overdispersion detected.
## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help.
##
## Scaled residual values: 0.4060429 0.7288599 0.3218639 0.6627756 0.6786652 0.6179234 0.3357164 0.7419022 0.6296398 0.2100956 0.8245195 0.08288212 0.8735348 0.756626 0.2582193 0.4837854 0.4102228 0.4112786 0.6805604 0.2090176 ...
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in glm.nb(y ~ farm * breed + farm * lambclass, data = dat): alternation
## limit reached
| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 18 | 18.8446 | NA | NA | NA |
| 18 | 18.8427 | 0 | 0.0019007 | NA |
| df | AIC | |
|---|---|---|
| mod4 | 18 | 206.1520 |
| mod4.nb | 19 | 208.1576 |
all_means_FB <- mod4 %>%
emmeans(~ farm:breed) %>%
cld(Letters = letters, type = "response")
all_means_FB| farm | breed | rate | SE | df | asymp.LCL | asymp.UCL | .group | |
|---|---|---|---|---|---|---|---|---|
| 8 | F2 | C | 3.583796 | 0.8192890 | Inf | 2.289553 | 5.609651 | a |
| 9 | F3 | C | 4.590603 | 0.9113328 | Inf | 3.110914 | 6.774097 | ab |
| 4 | F1 | B | 7.818046 | 1.2371518 | Inf | 5.733258 | 10.660928 | abc |
| 2 | F2 | A | 9.215476 | 1.6077453 | Inf | 6.546601 | 12.972378 | bcd |
| 5 | F2 | B | 10.111425 | 1.7280226 | Inf | 7.233420 | 14.134519 | bcd |
| 7 | F1 | C | 13.256687 | 1.6832473 | Inf | 10.336060 | 17.002587 | cd |
| 6 | F3 | B | 17.417288 | 2.3205498 | Inf | 13.414447 | 22.614567 | de |
| 1 | F1 | A | 25.493630 | 2.5453331 | Inf | 20.962644 | 31.003968 | ef |
| 3 | F3 | A | 30.243972 | 3.6372401 | Inf | 23.892991 | 38.283104 | f |
all_means_FL <- mod4 %>%
emmeans(~ farm:lambclass) %>%
cld(Letters = letters, type = "response")
all_means_FL| farm | lambclass | rate | SE | df | asymp.LCL | asymp.UCL | .group | |
|---|---|---|---|---|---|---|---|---|
| 11 | F2 | L3 | 1.211285 | 0.6072527 | Inf | 0.4534336 | 3.235777 | a |
| 12 | F3 | L3 | 1.541272 | 0.6330599 | Inf | 0.6890611 | 3.447471 | a |
| 2 | F2 | L0 | 4.239496 | 1.1435635 | Inf | 2.4986805 | 7.193127 | ab |
| 1 | F1 | L0 | 5.937368 | 1.3399401 | Inf | 3.8150091 | 9.240435 | abc |
| 4 | F1 | L1 | 10.093526 | 1.7582109 | Inf | 7.1741503 | 14.200883 | bc |
| 10 | F1 | L3 | 11.281000 | 1.8621120 | Inf | 8.1628719 | 15.590219 | bc |
| 3 | F3 | L0 | 11.302658 | 1.7788181 | Inf | 8.3026754 | 15.386616 | bc |
| 5 | F2 | L1 | 12.415666 | 1.9912184 | Inf | 9.0668149 | 17.001425 | c |
| 8 | F2 | L2 | 36.338536 | 3.5724079 | Inf | 29.9699889 | 44.060382 | d |
| 6 | F3 | L1 | 40.073061 | 3.6840219 | Inf | 33.4656540 | 47.985024 | d |
| 9 | F3 | L2 | 46.495026 | 4.0443228 | Inf | 39.2071782 | 55.137541 | d |
| 7 | F1 | L2 | 54.030052 | 4.3310308 | Inf | 46.1746233 | 63.221882 | d |